Shapley values are a way to quantify the contribution of each feature to the model's prediction. They are a generalization of the concept of marginal contribution to a coalition. In this homework, we will use the SHAP package and dalex to compute Shapley values for diffrent models trained on the Heart attack dataset. We will compare the results of these two packages and discuss the differences. We will also interpret the shapley values and discuss variable importance. Furthermore we will compare the shapley values calculated for two diffrent models: xgboost and logistic regression. We will investigate in detail the model prediction for selected obserations and discuss in detail the shapley values for these observations.
Data consists of 13 features and 1 column of labels (output). Dimension of the data is: (303,14). The features are:
Age : Age of the patient - categorical
Sex : Sex of the patient (1 male, 0 female) - categorical - 0
cp : Chest Pain type chest pain type - categorical
Value 0: typical angina
Value 1: atypical angina
Value 2: non-anginal pain
Value 3: asymptomatic
trtbps : resting blood pressure (in mm Hg) - continuous
chol : cholestoral in mg/dl fetched via BMI sensor - continuous
fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) - categorical
rest_ecg : resting electrocardiographic results - categorical
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
thalach : maximum heart rate achieved - continuous
exng: exercise induced angina (1 = yes; 0 = no) - categorical
oldpeak: ST depression induced by exercise relative to rest
slp : the slope of the peak exercise ST segment - categorical
Value 0: upsloping
Value 1: flat
Value 2: downsloping
Values 3: no information
caa: number of major vessels (0-4) - categorical
thall : thallium stress result - categorical
Value 0: normal
Value 1: fixed defect
Value 2: reversi
output : 0 = no heart disease, 1 = heart disease - categorical
The categorical features are one hot encoded because because they are not ordinal. The one hot encoding is done using the pandas get_dummies function.
Before creating a model, I will split the data to train and test sets. I will use 80% of the data for training and 20% for testing.
I will use train_test_split from sklearn.model_selection to split the data to train and test sets. I will use 80% of the data for training and 20% for testing. I will use random_state=42 to make the results reproducible.
XGBoost is a gradient boosting framework that uses a tree based learning algorithm. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
As we can see from the above table, the XGBoost model achieved very good performance.
The model achieved recall score of ~0.81, meaning that for 100 patients that actually have heart disease, 81 of them are predicted to have heart disease by the model.
The model achieved precision score of ~0.89, meaning that for 100 patients that are predicted to have heart disease, 89 of them actually have heart disease.
The model achieved accuracy score of ~0.86, meaning that for 100 patients, 86 of them are predicted correctly by the model.
The F1 score of the model is ~0.85. This means that the model is good at predicting heart disease. The F1 score is the harmonic mean of precision and sensitivity.
The AUC score of the model is ~0.92. This means that the model is good at distinguishing between patients with heart disease and patients without heart disease.
I will create new model on the whole dataset and use it to predict the heart disease of 2 selected patients. I will also explain the model's prediction using SHAP values.
The first patient, number 56, is a Male, aged 48 years, with chest pain type 0 (typical angina), resting blood pressure of 122 mm Hg, cholestoral of 222 mg/dl, fasting blood sugar of 0 (false), resting electrocardiographic results of 0 (normal), maximum heart rate achieved of 186, exercise induced angina of 0 (no), ST depression induced by exercise relative to rest of 0.0, the slope of the peak exercise ST segment of 2, number of major vessels of 0, and thalassemia of 2 (normal).
The model predicts that the patient has a ~0.99 probability of having a heart disease. The model's prediction is correct, as the patient actually has a heart disease.
The second patient, number 167, is a Female aged 62 years, with chest pain type 0 (typical angina), resting blood pressure of 140 mm Hg, cholestoral of 268 mg/dl, fasting blood sugar of 0 (false), resting electrocardiographic results of 0 (normal), maximum heart rate achieved of 160, exercise induced angina of 0 (no), ST depression induced by exercise relative to rest of 3, the slope of the peak exercise ST segment of 0, number of major vessels of 2, and thalassemia of 2 (normal).
The model predicts that the patient has a ~0.019 probability of having heart disease. The model's prediction is correct, as the patient actually does not have a heart disease.
Based on the follwing plot, we can interpret the following:
1) On average, ~55% have heart disease based on model & dataset.
2) The most important features for the model to make a prediction about whether the patient 56 has heart disease are:
1) caa_0 = 1 - number of major vessels of 0 - it increased the probability of having heart disease by ~0.241 percentage points (from ~0.545 to ~0.786).
2) thall_2 = 1 - thalassemia of 2 (normal) - it increased the probability of having heart disease by ~0.101 percentage points (from ~0.545 to ~0.646).
3) cp_0 = 1 - chest pain type 0 (typical angina) - it decreased the probability of having heart disease by ~0.058 percentage points (from ~0.545 to ~0.487).
4) oldpeak = 0 - ST depression induced by exercise relative to rest - it increased the probability of having heart disease by ~0.088 percentage points (from ~0.545 to ~0.582).
5) chol = 222 - cholestoral in mg/dl fetched via BMI sensor - it increased the probability of having heart disease by ~0.073 percentage points (from ~0.545 to ~0.618).
6) thall_3 = 0 - lack of thalassemia of 3 (reversable defect) - it increased the probability of having heart disease by ~0.044 percentage points (from ~0.545 to ~0.589).
7) tbalachh = 186 - maximum heart rate achieved - it increased the probability of having heart disease by ~0.015 percentage points (from ~0.545 to ~0.560).
8) sex_0 = 0.0 - being Male decreased the probability of having heart disease by ~0.03 percentage points (from ~0.545 to ~0.515).
9) exng_0 = 1 - exercise induced angina of 1 (yes) - it increased the probability of having heart disease by ~0.013 percentage points (from ~0.545 to ~0.558).
Based on the follwing plot, we can interpret the following:
1) On average, ~55% have heart disease based on model & dataset.
2) The most important features for the model to make a prediction about whether the patient 167 has heart disease are:
1) caa_0 = 0 - having more than 0 number of major vessels - it decreased the probability of having heart disease by ~0.269 percentage points (from ~0.545 to ~0.276).
2) cp_0 = 1 - chest pain type 1 (atypical angina) - it decreased the probability of having heart disease by ~0.129 percentage points (from ~0.545 to ~0.416).
3) oldpeak = 3.6 - ST depression induced by exercise relative to rest of 3.6 - it decreased the probability of having heart disease by ~0.088 percentage points (from ~0.545 to ~0.457).
4) age = 62 - age in years - it decreased the probability of having heart disease by ~0.151 percentage points (from ~0.545 to ~0.394).
5) sex_0 = 1 - being Female increased the probability of having heart disease by ~0.072 percentage points (from ~0.545 to ~0.617).
6) slp_1 = 0 - the lack of slope of the peak exercise ST segment of 1 (flat) - it increased the probability of having heart disease by ~0.023 percentage points (from ~0.545 to ~0.568).
7) exng_0 = 1 - exercise induced angina of 1 (yes) - it increased the probability of having heart disease by ~0.018 percentage points (from ~0.545 to ~0.563).
The effects of each variables on the probability of having heart disease differ depending on the patient, because shap values are local. However, for both patients, the most important feature for the model to make a prediction about whether the patient has heart disease is the number of major vessels (caa).
Now, I will use the package shap to calculate the Shapley values for the same 2 patients.
Overall, the results from both pacakges are similiar in terms of direction of marginal effects of each features. However, the magnitude of the effects are different, because Dalex and shap packages use diffrent algorithms to estimate the shap values. The algorithm from shap package is deterministic, while the algorithm from Dalex package is stochastic. The stochastic algorithm from Dalex package is more accurate, but it is also slower.
Based on the follwing plot, we can interpret the following:
1) On average, ~55% have heart disease based on model & dataset.
2) The most important features for the model to make a prediction about whether the patient 56 has heart disease are:
1) caa_0 = 1 - number of major vessels of 0 - it increased the probability of having heart disease by ~0.19 percentage points (from ~0.545 to ~0.825).
2) etc.
To futher show the locality of the SHAP values I will try to find two observations in the dataset, such that they have different variables of the highest importance.
Based on the code, patients 1 and 5 have diffrent variables of the highest importance. The most important feature for the model to make a prediction about whether the patient 1 has heart disease is the feature cp_0 = 0 - chest pain type 1 (atypical angina), while the most important feature for the model to make a prediction about whether the patient 5 has heart disease is the feature oldpeak = 0.4 - ST depression induced by exercise relative to rest of 0.4.
I will try to select one variable X and find two observations in the dataset such that for one observation, X has a positive attribution, and for the other observation, X has a negative attribution.
As variable X I will choose age.
The first observation would patient 1, for whom the feature age has a positive attribution.
As we can see from the plot below, the feature age has a positive attribution for patient 1, while it has a negative attribution for patient 5.
I will Train another model (logistic regression) and find an observation for which SHAP attributions are different between this model and the xgboost.
As we can see from the plots below, two diffrent models (xgboost and logistic regression) have diffrent attributions for the same observation (patient 56). For instance, the variable tbalachh has a attribution for the Logistic Regression model of ~0.1, while it has a attribution for the XGBoost model of ~0.015, which is a diffrence of ~0.085. However, generally, the attributions are similar for both models.
We will now look closely how to calculate Shapley values by hand. Consider a game with 3 players and following payoffs:
v() = 0
v(A) = 20
v(B) = 20
v(C) = 60
v(A,B) = 60
v(A,C) = 70
v(B,C) = 70
v(A,B,C) = 100
Our goal is to calculate the Shapley value player A.

In this notebook we have seen how to use the package dalex to explain the predictions of a model. We have also seen how to use the package shap to explain the predictions of a model. We have compared the results from both packages and found that the results are similiar in terms of direction of marginal effects of each features. However, the magnitude of the effects were slightly diffrent.
We also succesfully found observations for which the most important SHAP attribution is different. Furthermore, we have found two observations in the dataset, such that they have different sign of attributes. The lack of consistancy shown from these examples does not necesarly mean that the method is not reliable. It is just a reminder that we should be careful when interpreting the results and take into consideration the interaction between the variables.
We also compared two models (xgboost and logistic regression) and found that they have diffrent attributions for the same observation (patient 56). However, generally, the attributions are similar for both models.